home *** CD-ROM | disk | FTP | other *** search
- From: Andras Kornai <andras@calera.com>
- Subject: Re: One more kermit question
- To: fdc@watsun.cc.columbia.edu (Frank da Cruz)
- Date: Thu, 11 Mar 93 21:33:45 PST
-
- ----------------------------------------------------------------------
- CYRILLIC ENCODING FAQ Version 1.3, March 13 1993
-
- ACKNOWLEDGEMENTS Most of the information was provided by the following:
-
- David J. Birnbaum <djbpitt+@pitt.edu>
- Frank da Cruz <fdc@watsun.cc.columbia.edu>
- Bur Davis <bdavis@adobe.com>
- George Fowler <gfowler@ucs.indiana.edu>
- Richard B. Paine <RPAINE@CCNODE.Colorado.EDU>
- Slava Paperno <PAPY@CORNELLA.cit.cornell.edu>
- Keld J. Simonsen <Keld.Simonsen@dkuug.dk>
- Glenn E. Thobe <thobe@getunx.info.com>
- Dimitri Vulis <DLV@CUNYVMS1.BITNET>
- Johan W. van Wingen <precal@rulmvs.leidenuniv.nl>
-
-
- Thanks to all who contributed -- I am responsible for the errors that
- still remain.
-
- Andras Kornai (andras@calera.com, kornai@csli.stanford.edu)
-
-
- Q: What are the commonly used computer encodings for Cyrillic?
- A: Broadly speaking, there are three kinds of schemes in use: those that
- replace Cyrillic characters by 7-bit ascii values, those that use the
- full 8-bit range 0-255, and those using multi-byte codes. Presently
- only the first two types are in wide use, but for reference purposes I
- will also discuss the third type.
-
-
- Q: What kind of transliteration schemes are there?
- A: The most important one is called KOI-7: the Russian alphabet is given
- by the ASCII characters (note the exchange of upper and lower cases):
-
- UPPER CASE: abwgde$vzijklmnoprstufhc~{}"yx|`q
- lower case: ABWGDE#VZIJKLMNOPRSTUFHC^[]_YX\@Q
-
- The following extensions to the official standard KOI-7 are supported in
- Glenn Thobe's conversion programs for invertibility: '"'=YER, '#'=yo,
- '$'=YO, '<'=left guillemet, '>'=right guillemet.
-
- A slightly different (multicharacter) scheme is employed by Steve
- Gaardner's (gaarder@theory.tc.cornell.edu) conversion code from Old
- KOI-8, included below. This particular scheme provides easy
- readability but suffers from some transliteration weirdness, such as
- mapping short ii and yeri on the same character. Since proper
- transliteration often requires context-sensitive rules, and differs
- from language to language within the same script, a fuller discussion
- is beyond the scope of the present document. For an overview of the
- major Cyrillic to Latin transliteration schemes used in the US, see pp
- 457-460 of the Style Manual of the US Government Printing Office, for
- sale by the Superintendent of Documents, USGPO, Washington DC 20402,
- Stock Number 021-000-00120-1 (paper) or 021-000-00120-0 (hardbound).
- See also the Chicago Manual of Style, and Transliteracija russkikh
- slov latinskimi bukvami, GOST 167876-71
-
-
- #include <stdio.h>
- char transtbl[64][5] =
- {"yu", "a", "b", "ts", "d" , "e", "f", "g", "kh", "i", "y" , "k", "l",
- "m", "n", "o", "p", "ya", "r" , "s", "t", "u", "zh", "v", "'",
- "y", "z", "sh", "e", "shch", "ch", "`",
- "YU", "A", "B", "TS", "D" , "E", "F", "G", "KH", "I", "Y" , "K", "L",
- "M", "N", "O", "P", "YA", "R" , "S", "T", "U", "ZH", "V", "'",
- "Y", "Z", "SH", "E", "SHCH", "CH", "`" };
- main()
- {
- int c;
-
- while ((c = getchar()) != EOF)
- { if ( c > 0x80) c -= 0x80;
- if ( c < 0x40) putchar(c);
- else printf("%s",transtbl[c-0x40]);
- }
- }
-
-
- Q: What are the eight-bit schemes?
-
- A: For the IBM mainframe world, which includes the ES (edinnaja sistema)
- clones of 360-370 mainframes, the basic scheme, called DKOI-8, extends
- EBCDIC by putting the Cyrillic letters in the unused slots, mostly in
- the rectangle 0x8a to 0xff (first hex digit >=8, second digit >=a). The
- mysteries of EBCDIC/ASCII conversion go beyond the scope of this
- document, and in the table that follows I will ignore 8-bit ascii values
- below 0xa0 and refer the reader to Dimitri Vulis' excellent document,
- which sheds some light on the IBM meaning of the characters 0x80-0x9f
- which are reserved in both IS0 8859-1 (Latin-1) and 8859-5 (Cyrillic).
-
- /* From 8859-5 to DKOI-8. ebcdic(isoval) = isotoibm[isoval-160] */
-
- int isotoibm[96] = {
- 0x41,0xaa,0x4a,0xb1,0x9f,0xb2,0x6a,0xb5,
- 0xbd,0xb4,0x9a,0x8a,0x5f,0xca,0xaf,0xbc,
- 0x90,0x8f,0xea,0xfa,0xbe,0xa0,0xb6,0xb3,
- 0x9d,0xda,0x9b,0x8b,0xb7,0xb8,0xb9,0xab,
- 0x64,0x65,0x62,0x66,0x63,0x67,0x9e,0x68,
- 0x74,0x71,0x72,0x73,0x78,0x75,0x76,0x77,
- 0xac,0x69,0xed,0xee,0xeb,0xef,0xec,0xbf,
- 0x80,0xfd,0xfe,0xfb,0xfc,0xad,0xae,0x59,
- 0x44,0x45,0x42,0x46,0x43,0x47,0x9c,0x48,
- 0x54,0x51,0x52,0x53,0x58,0x55,0x56,0x57,
- 0x8c,0x49,0xcd,0xce,0xcb,0xcf,0xcc,0xe1,
- 0x70,0xdd,0xde,0xdb,0xdc,0x8d,0x8e,0xdf
- };
-
- There are minor variations to DKOI, called Cyrillic Extended Code Page
- 037 (most common on BITNET), CECP 500 (which is the definitive one), the
- "JNET" and the "FORTRAN" mappings. The differences between these are
- tabulated below. Notice that EBCDIC/DKOI, unlike ASCII, is not uniquely
- defined even on the 0-127 range:
-
-
- 8859-5 037 500 JNET FORTRAN
-
- 0x21 0x5a 0x4f 0x5a 0x4f exclamation point (bang)
- 0x5b 0xba 0x4a 0xad 0x4a opening square bracket
- 0x5d 0xbb 0x5a 0xbd 0x5a closing square bracket
- 0x5e 0xb0 0x5f 0x5f 0x5f circumflex accent
- 0x7c 0x4f 0xbb 0x6a 0x4f logical or (vertical bar)
- [a2] 0x4a 0xb0 0x43 0x43 centsign (in 037)/capital dje (in 500)
- [ac] 0x5f 0xba 0x54 0x54 logical not (in 037)/capital kje (in 500)
- 0xd5 0xef 0xef 0xbb 0xad small ie
- 0xe3 0x46 0x46 0x4a 0xbb small u
- 0xe5 0x47 0x47 0xfc 0xbd small kha
- 0xfc 0xdc 0xdc 0x6a 0xfc small kje
-
-
-
-
- For the Internet, the most important code seems to be Old KOI-8, widely used
- in the Relcom groups (but probably not a whole lot elsewhere). Old KOI-8
- (GOST 19768-74) from 1974 more or less follows Latin transliteration order
- and does not include upper-case hard sign, or letters common to other Slavic
- Cyrillic alphabets (Bulgarian, Macedonian, Serbian, Ukrainian...). In the
- 0-127 range it is identical with ascii, and for the 192-254 region see the
- transtabl array above. Some software, including uunpack (also used in
- Sergej Ryzhkov's bml, aka Beauty Mail system for PCs) which is distributed
- by Relcom, force upper-case hard sign to 255, others (and the standard!)
- declare this incorrect, or perhaps reserve 255 for DEL. In an earlier
- version of Andrew Hume's <andrew@research.att.com> tcs, which supports
- conversion across a wide variety of Cyrillic encodings, this was called the
- "mystery DOS Cyrillic encoding", except that his sha and shcha seem to be
- interchanged. Tcs is available for anon ftp from research.att.com in
- directory /dist/tcs.shar.Z. The semantics of 128-191 in Old KOI is unclear
- to me. If there is an official code page (it was suggested that Xenix users
- might have one), please post it.
-
- For the PC community, Code Page 866 seems to be quite important. This is
- what Microsoft is using in its russified version of MS-DOS. In 0-31
- ascii control chars are replaced by a random selection of dingbats. In
- 32-126 it is identical to ascii, and in 127 it has something that looks
- like a little house (the interpretation of such positions seems to be
- subject to much uncertainty). The Russian part (128-255) is identical to
- Brjabrin's alternativnyj variant, except for 242-251, where some of the
- accents/symbols of AV are replaced by non-Russian Cyrillic characters
- and other symbols. Unfortunately CP 866 covers only Ukrainian and
- Belorussian, with the vague suggestion that e.g. Macedonian users could
- redefine the six non-Russian Cyrillic positions. This problem is
- largely resolved in Code Page 1251, the Microsoft Cyrillic Windows 3.1
- character set, (also endorsed by WordPerfect and Adobe), which contains
- all Cyrillic letters used by modern Slavic languages. CP 1251 is fully
- compatible with ascii on 0-127 (leaves control positions undefined), has
- the Russian alphabet (in order, but without io) in 192-256, and puts the
- non-Russian Cyrillic, Russian io, and a few symbols in 128-191.
-
- Brjabrin's Alternativnyj Variant (AV) is also widely used on PCs. It
- has Russian in 128 to 175 in alphabetical order except for yo, graphics
- characters in 176 to 223, again Russian in 224-241. The same set of
- graphics characters, but not in the same order, is used in Brajabin's
- Osnovnoj Variant: they are similar to, but not identical with, IBM
- Extended ASCII graphics chars (neither the set of shapes nor the code
- values are the exact same). AV and OV have no non-Russian Cyrillic or
- accented characters, but four accent marks are provided: 242 (acute
- below the symbol), 243 (grave below the symbol), 244 (acute above the
- symbol), and 245 (grave above the symbol). These, as well as upper case
- and lower case yo, codes 240 and 241, are in the same position in
- Osnovnoj Variant as well. Codes 246 - 249 are arrows, pointing right,
- left, down, up, in that order. Codes 250 and 251 are, in both sets
- described by Briabrin, the division sign and the plus/minus sign (the
- latter becomes a radical sign in 866). 252 is the Number symbol, 253 is
- a sunburst, and 254 is "end of proof". 255 is in principle unused -- in
- practice people put things there.
-
- For the academic community, the lack of accents is remedied by the
- Academic version of AV developed at Cornell, which includes upper and
- lower case acute-accented vowels, and lower case grave-accented vowels.
- These replace all but six of the graphics characters (the six that were
- retained are those that are necessary for drawing a single-line box).
- The accented vowels in this set include a grave-accented lower case yo.
- Also included are the letters with diacritics used in French, German,
- and Spanish. The complete chart and DOS/Windows software may be
- requested from Exceller Software Corp. 800-426-0444. (This is NOT a
- product endorsement -- I haven't even seen the stuff!) Cornell also
- developed an Academic version of CP1251. In this, non-Russian Slavic
- languages are not supported: their letters have been replaced by Russian
- accented vowels. These include upper and lower case acute-accented
- vowels, and lower case grave-accented vowels. Also included are upper
- and lower case grave-accented yo. The AcademicFont Cyrillic character
- set was developed by University Microcomputers, who pioneered the use of
- Slavic languages on IBM-compatible computers in the US in the
- mid-eighties. This set is included among the 11 sets in Exceller's
- product. It supports Slavic and some non-Slavic languages, but not
- accented vowels.
-
- For the Macintosh community, there is a separate code page. It is ascii
- below 128, has the Russian capital letters in 128-159 in alphabetical order
- (as usual, io is treated separately) and the Russian lowercase letters in
- 240-254, but lower case ja is moved to 239, its place taken by the sunburst
- symbol. In the 160-238 range we finde the same set of (ISO 8859-5)
- non-Russian Cyrillic characters as in CP 1251. The symbols that appear here
- are also largely the same as in 1251, but the orderings are completely
- different and a few symbols are unique to one or the other, e.g. permille
- in 1251, capital delta in the Mac encoding. While a Macintosh version
- capable of character conversion is still on the drawing boards, for most
- other platforms Columbia Kermit is capable of converting between a large
- variety of Cyrilic encodings. Anon ftp to watsun.cc.columbia.edu: for
- C-Kermit 5A(188) (Unix, VMS, OS/2, Amiga etc) get file kermit/b/ckaaaa.hlp,
- read it, take it from there. For MS-DOS Kermit 3.11, get (in binary mode)
- kermit/bin/msvibm.zip, then unzip. For IBM Mainframe Kermit 4.2 and later,
- get kermit/b/ik0*.* plus one of the following: kermit/b/ikc*.* for VM/CMS,
- kermit/b/ikt*.* for MVS, kermit/b/ikx*.* for CICS or kermit/b/ikm*.* for
- MUSIC. There is also a large collection of character-set tables under
- kermit/charsets.
-
- Finally, the most broadly accepted standard outside these communities seems
- to be GOSTSCI (GOSTCII), a term used colloquially to refer to Brjabrin's
- Osnovnoj Variant or to ISO 8859-5 (which is also ECMA 113), although these
- two are not identical when it comes to non-Russian Cyrillic. The term "New
- KOI-8" means the 1987 revision of KOI-8 (GOST 19768-87) -- all these use the
- same (alphabetical, except for yo) order as 8859/5, starting with A at 176.
- However, the non-Russian Cyrillic characters (160-176 and 240-255 in new
- KOI-8) are not part of OV, their space is taken up by some graphics chars
- described for AV above. ISO 8859-5 provides for the Cyrillic characters
- required for writing all major Slavic Cyrillic alphabets (Belorussian,
- Bulgarian, Macedonian, Serbian, Ukrainian...), but not for those alphabets
- that were devised for non-Slavic languages in the Soviet Union (Abkhazian,
- Bashkir, Chukchee, Khanty, Tajik, ....), or archaic letters.
-
-
- Q: Is this a big mess or what?
- A: To straighten this out, it seems necessary to adopt a fixed point of
- reference, which I take to be Unicode V1.1 = ISO 10646-1.2. While in
- principle 10646 is a four-byte standard and Unicode uses 16-bit integers,
- the "Basic Multilingual Plane" of 10646 is by definition identical to the
- values assigned in Unicode 1.1, both being two-byte quantities (called UCS-2
- by ISO). The following list gives the essential part of the names of the
- Cyrillic characters and the last two hex digits of their Unicode/10646
- encoding.
-
- For reasons of space, the official Unicode/10646 names have been
- abbreviated. For a full list of names, anon ftp to unicode.org, cd to
- pub/MappingTables, and get namesall.lst (which is slightly over 200k). To
- get back the full official name from the abbreviations, always add the
- prefix CYRILLIC, unless the position is UNUSED. Further, expand CAP (SMA) to
- CAPITAL (SMALL). Finally, the word LETTER should be added after CAP/SMA,
- unless it is THOUSANDS, LIGATURE, or COMBINING. The numerical code values
- given in the second column have also been abbreviated to the last two
- digits, since the preceding two hex digits (really signifying "Cyrillic")
- are always 04 in Unicode/10646.
-
- The third column gives the-two character mnemonic abbreviations suggested in
- Keld Simonsen's RFC1345 where they exist, to facilitate cross-reference to
- this document (available by anon ftp e.g. from sunsite.unc.edu as
- /pub/doc/rfp/rfp1345.txt.Z) which has tables for Serbian, Macedonian, as
- well as other Cyrillic encodings (IBM CP 880, INIS-cyrillic = ISO-IR-51,
- ECMA-cyrillic = ISO-IR-111) whose domain of usage is unclear to me, and
- whose table for Old KOI seems to be in fact a New KOI table. I will add
- conversion tables for these (or for any other) encodings provided a real
- user community exists and actually generates some public domain
- machine-readable texts.
-
- UNUSED 00
- CAP IO 01 IO
- CAP DJE 02 D%
- CAP GJE 03 G%
- CAP E 04 IE
- CAP DZE 05 DS
- CAP I 06 II
- CAP YI 07 YI
- CAP JE 08 J%
- CAP LJE 09 LJ
- CAP NJE 0A NJ
- CAP TSHE 0B Ts
- CAP KJE 0C KJ
- UNUSED 0D
- CAP SHORT U 0E V%
- CAP DZHE 0F DZ
- CAP A 10 A=
- CAP BE 11 B=
- CAP VE 12 V=
- CAP GE 13 G=
- CAP DE 14 D=
- CAP IE 15 E=
- CAP ZHE 16 Z%
- CAP ZE 17 Z=
- CAP II 18 I=
- CAP SHORT II 19 J=
- CAP KA 1A K=
- CAP EL 1B L=
- CAP EM 1C M=
- CAP EN 1D N=
- CAP O 1E O=
- CAP PE 1F P=
- CAP ER 20 R=
- CAP ES 21 S=
- CAP TE 22 T=
- CAP U 23 U=
- CAP EF 24 F=
- CAP KHA 25 H=
- CAP TSE 26 C=
- CAP CHE 27 C%
- CAP SHA 28 S%
- CAP SHCHA 29 Sc
- CAP HARD SIGN 2A ="
- CAP YERI 2B Y=
- CAP SOFT SIGN 2C %"
- CAP REVERSED E 2D JE
- CAP IU 2E JU
- CAP IA 2F JA
- SMA A 30 a=
- SMA BE 31 b=
- SMA VE 32 v=
- SMA GE 33 g=
- SMA DE 34 d=
- SMA IE 35 e=
- SMA ZHE 36 z%
- SMA ZE 37 z=
- SMA II 38 i=
- SMA SHORT II 39 j=
- SMA KA 3A k=
- SMA EL 3B l=
- SMA EM 3C m=
- SMA EN 3D n=
- SMA O 3E o=
- SMA PE 3F p=
- SMA ER 40 r=
- SMA ES 41 s=
- SMA TE 42 t=
- SMA U 43 u=
- SMA EF 44 f=
- SMA KHA 45 h=
- SMA TSE 46 c=
- SMA CHE 47 c%
- SMA SHA 48 s%
- SMA SHCHA 49 sc
- SMA HARD SIGN 4A ='
- SMA YERI 4B y=
- SMA SOFT SIGN 4C %'
- SMA REVERSED E 4D je
- SMA IU 4E ju
- SMA IA 4F ja
- UNUSED 50
- SMA IO 51 io
- SMA DJE 52 d%
- SMA GJE 53 g%
- SMA E 54 ie
- SMA DZE 55 ds
- SMA I 56 ii
- SMA YI 57 yi
- SMA JE 58 j%
- SMA LJE 59 lj
- SMA NJE 5A nj
- SMA TSHE 5B ts
- SMA KJE 5C kj
- UNUSED 5D
- SMA SHORT U 5E v%
- SMA DZHE 5F dz
- CAP OMEGA 60
- SMA OMEGA 61
- CAP YAT 62 Y3
- SMA YAT 63 y3
- CAP IOTIFIED E 64
- SMA IOTIFIED E 65
- CAP LITTLE YUS 66
- SMA LITTLE YUS 67
- CAP IOTIFIED LITTLE YUS 68
- SMA IOTIFIED LITTLE YUS 69
- CAP BIG YUS 6A O3
- SMA BIG YUS 6B o3
- CAP IOTIFIED BIG YUS 6C
- SMA IOTIFIED BIG YUS 6D
- CAP KSI 6E
- SMA KSI 6F
- CAP PSI 70
- SMA PSI 71
- CAP FITA 72 F3
- SMA FITA 73 f3
- CAP IZHITSA 74 V3
- SMA IZHITSA 75 v3
- CAP IZHITSA DOUBLE GRAVE 76
- SMA IZHITSA DOUBLE GRAVE 77
- CAP UK DIGRAPH 78
- SMA UK DIGRAPH 79
- CAP ROUND OMEGA 7A
- SMA ROUND OMEGA 7B
- CAP OMEGA TITLO 7C
- SMA OMEGA TITLO 7D
- CAP OT 7E
- SMA OT 7F
- CAP KOPPA 80 C3
- SMA KOPPA 81 c3
- THOUSANDS SIGN 82
- NON-SPACING TITLO 83
- NON-SPACING PALATALIZATION 84
- NON-SPACING DASIA PNEUMATA 85
- NON-SPACING PSILI PNEUMATA 86
- UNUSED 87
- UNUSED 88
- UNUSED 89
- UNUSED 8A
- UNUSED 8B
- UNUSED 8C
- UNUSED 8D
- UNUSED 8E
- UNUSED 8F
- CAP GE WITH UPTURN 90 G3
- SMA GE WITH UPTURN 91 g3
- CAP GE BAR 92
- SMA GE BAR 93
- CAP GE HOOK 94
- SMA GE HOOK 95
- CAP ZHE WITH RIGHT DESCENDER 96
- SMA ZHE WITH RIGHT DESCENDER 97
- CAP ZE CEDILLA 98
- SMA ZE CEDILLA 99
- CAP KA WITH RIGHT DESCENDER 9A
- SMA KA WITH RIGHT DESCENDER 9B
- CAP KA VERTICAL BAR 9C
- SMA KA VERTICAL BAR 9D
- CAP KA BAR 9E
- SMA KA BAR 9F
- CAP REVERSED GE KA A0
- SMA REVERSED GE KA A1
- CAP EN WITH RIGHT DESCENDER A2
- SMA EN WITH RIGHT DESCENDER A3
- CAP EN GE A4
- SMA EN GE A5
- CAP PE HOOK A6
- SMA PE HOOK A7
- CAP O HOOK A8
- SMA O HOOK A9
- CAP ES CEDILLA AA
- SMA ES CEDILLA AB
- CAP TE WITH RIGHT DESCENDER AC
- SMA TE WITH RIGHT DESCENDER AD
- CAP STRAIGHT U AE
- SMA STRAIGHT U AF
- CAP STRAIGHT U BAR B0
- SMA STRAIGHT U BAR B1
- CAP KHA WITH RIGHT DESCENDER B2
- SMA KHA WITH RIGHT DESCENDER B3
- CAP TE TSE B4
- SMA TE TSE B5
- CAP CHE WITH RIGHT DESCENDER B6
- SMA CHE WITH RIGHT DESCENDER B7
- CAP CHE VERTICAL BAR B8
- SMA CHE VERTICAL BAR B9
- CAP H BA
- SMA H BB
- CAP IE HOOK BC
- SMA IE HOOK BD
- CAP IE HOOK OGONEK BE
- SMA IE HOOK OGONEK BF
- PALOCHKA C0
- CAP SHORT ZHE C1
- SMA SHORT ZHE C2
- CAP KA HOOK C3
- SMA KA HOOK C4
- UNUSED C5
- UNUSED C6
- CAP EN HOOK C7
- SMA EN HOOK C8
- UNUSED C9
- UNUSED CA
- CAP CHE WITH LEFT DESCENDER CB
- SMA CHE WITH LEFT DESCENDER CC
- UNUSED CD
- UNUSED CE
- UNUSED CF
- CAP A WITH BREVE D0
- SMA A WITH BREVE D1
- CAP A WITH DIAERESIS D2
- SMA A WITH DIAERESIS D3
- CAP LIGATURE A IE D4
- SMA LIGATURE A IE D5
- CAP IE WITH BREVE D6
- SMA IE WITH BREVE D7
- CAP SCHWA D8
- SMA SCHWA D9
- CAP SCHWA WITH DIAERESIS DA
- SMA SCHWA WITH DIAERESIS DB
- CAP ZHE WITH DIAERESIS DC
- SMA ZHE WITH DIAERESIS DD
- CAP ZE WITH DIAERESIS DE
- SMA ZE WITH DIAERESIS DF
- CAP ABKHASIAN DZE E0
- SMA ABKHASIAN DZE E1
- CAP I WITH MACRON E2
- SMA I WITH MACRON E3
- CAP I WITH DIAERESIS E4
- SMA I WITH DIAERESIS E5
- CAP O WITH DIAERESIS E6
- SMA O WITH DIAERESIS E7
- CAP BARRED O E8
- SMA BARRED O E9
- CAP BARRED O WITH DIAERESIS EA
- SMA BARRED O WITH DIAERESIS EB
- CAP U WITH ACUTE EC
- SMA U WITH ACUTE ED
- CAP U WITH MACRON EE
- SMA U WITH MACRON EF
- CAP U WITH DIAERESIS F0
- SMA U WITH DIAERESIS F1
- CAP U WITH DOUBLE ACUTE F2
- SMA U WITH DOUBLE ACUTE F3
- CAP CHE WITH DIAERESIS F4
- SMA CHE WITH DIAERESIS F5
- CAP DJE WITH ACUTE F6
- SMA DJE WITH ACUTE F7
- CAP YERU WITH DIAERESIS F8
- SMA YERU WITH DIAERESIS F9
- UNUSED FA
- UNUSED FB
- UNUSED FC
- UNUSED FD
- UNUSED FE
- UNUSED FF
-
-
-
- Q: Is everything clear now?
-
- A: Probably not. To ease the pain, here follow some tentative conversion
- tables *from* the 8-bit schemes described above *to* Unicode. Since the
- Unicode/10646 character set is much larger, no tables are provided in
- the other direction.
-
- In the 0-127 range everything is ASCII (except for the CP866 dingbats in
- the range 0-31 which are at any rate optional, and for EBCDIC/DKOI-8, for
- which see above) so here tables are only provided for 128-255. Notice
- that often values other than starting with 0x04 are given, meaning that
- the Unicode equivalent is outside the Unicode Cyrillic range
- 0x0400-0x04ff, but included at some other place, typically among the
- arrows (0x2190-0x21ff) or other semigraphic material (0x2500-0x25ff). If
- a particular encoding leaves (by official definition, not necessarily in
- practical usage) some code unused, this is designated by "-1" in the
- conversion table. For some positions the tables show a "-2", meaning
- that I have no information on the intended meaning. (This is not the
- same as there being no Unicode codepoint for the character in question,
- a situation we potentially encounter with AV and OV 242-245, see note
- there.)
-
-
-
- /* From old Koi-8 to Unicode */
-
- long oldkoi8tou[128] = {
- -2, -2, -2, -2, -2, -2, -2, -2,
- -2, -2, -2, -2, -2, -2, -2, -2,
- -2, -2, -2, -2, -2, -2, -2, -2,
- -2, -2, -2, -2, -2, -2, -2, -2,
- -2, -2, -2, -2, -2, -2, -2, -2,
- -2, -2, -2, -2, -2, -2, -2, -2,
- -2, -2, -2, -2, -2, -2, -2, -2,
- -2, -2, -2, -2, -2, -2, -2, -2,
- 0x044e,0x0430,0x0431,0x0446,0x0434,0x0435,0x0444,0x0433,
- 0x0445,0x0438,0x0439,0x043a,0x043b,0x043c,0x043d,0x043e,
- 0x043f,0x044f,0x0440,0x0441,0x0442,0x0443,0x0436,0x0432,
- 0x044c,0x044b,0x0437,0x0448,0x044d,0x0449,0x0447,0x044a,
- 0x042e,0x0410,0x0411,0x0426,0x0414,0x0415,0x0424,0x0413,
- 0x0425,0x0418,0x0419,0x041a,0x041b,0x041c,0x041d,0x041e,
- 0x041f,0x042f,0x0420,0x0421,0x0422,0x0423,0x0416,0x0412,
- 0x042c,0x042b,0x0417,0x0428,0x042d,0x0429,0x0427,0x042a
- };
-
-
- /* From CP866 to Unicode */
-
- long cp866tou[128] = {
- 0x0410,0x0411,0x0412,0x0413,0x0414,0x0415,0x0416,0x0417,
- 0x0418,0x0419,0x041a,0x041b,0x041c,0x041d,0x041e,0x041f,
- 0x0420,0x0421,0x0422,0x0423,0x0424,0x0425,0x0426,0x0427,
- 0x0428,0x0429,0x042a,0x042b,0x042c,0x042d,0x042e,0x042f,
- 0x0430,0x0431,0x0432,0x0433,0x0434,0x0435,0x0436,0x0437,
- 0x0438,0x0439,0x043a,0x043b,0x043c,0x043d,0x043e,0x043f,
- 0x2591,0x2592,0x2593,0x2502,0x2524,0x2561,0x2562,0x2556,
- 0x2555,0x2563,0x2551,0x2557,0x255d,0x255c,0x255b,0x2510,
- 0x2514,0x2534,0x252c,0x251c,0x2500,0x253c,0x255e,0x255f,
- 0x255a,0x2554,0x2569,0x2566,0x2560,0x2550,0x256c,0x2567,
- 0x2568,0x2564,0x2565,0x2559,0x2558,0x2552,0x2553,0x256b,
- 0x256a,0x2518,0x250c,0x2588,0x2584,0x258c,0x2590,0x2580,
- 0x0440,0x0441,0x0442,0x0443,0x0444,0x0445,0x0446,0x0447,
- 0x0448,0x0449,0x044a,0x044b,0x044c,0x044d,0x044e,0x044f,
- 0x0401,0x0451,0x0404,0x0454,0x0407,0x0457,0x040e,0x045e,
- 0x00b0,0x2022,0x00b7,0x221a,0x2116,0x00a4,0x25a0, -1
- };
-
-
- /* From CP1251 to Unicode */
-
- long cp1251tou[128] = {
- 0x0402,0x0403,0x201a,0x0453,0x201e,0x2026,0x2020,0x2021,
- -1,0x2030,0x0409,0x2039,0x040a,0x040c,0x040b,0x040f,
- 0x0452,0x2018,0x2019,0x201c,0x201d,0x2022,0x2013,0x2014,
- -1,0x2122,0x0459,0x203a,0x045a,0x045c,0x045b,0x045f,
- 0x00a0,0x040e,0x045e,0x0408,0x00a4,0x0490,0x00a6,0x00a7,
- 0x0401,0x00a9,0x0404,0x00ab,0x00ac,0x00ad,0x00ae,0x0407,
- 0x00b0,0x00b1,0x0406,0x0456,0x0491,0x00b5,0x00b6,0x00b7,
- 0x0451,0x2116,0x0454,0x00bb,0x0458,0x0405,0x0455,0x0457,
- 0x0410,0x0411,0x0412,0x0413,0x0414,0x0415,0x0416,0x0417,
- 0x0418,0x0419,0x041a,0x041b,0x041c,0x041d,0x041e,0x041f,
- 0x0420,0x0421,0x0422,0x0423,0x0424,0x0425,0x0426,0x0427,
- 0x0428,0x0429,0x042a,0x042b,0x042c,0x042d,0x042e,0x042f,
- 0x0430,0x0431,0x0432,0x0433,0x0434,0x0435,0x0436,0x0437,
- 0x0438,0x0439,0x043a,0x043b,0x043c,0x043d,0x043e,0x043f,
- 0x0440,0x0441,0x0442,0x0443,0x0444,0x0445,0x0446,0x0447,
- 0x0448,0x0449,0x044a,0x044b,0x044c,0x044d,0x044e,0x044f,
- };
-
-
- /* From Mac to Unicode */
-
- long mactou[128] = {
- 0x0410,0x0411,0x0412,0x0413,0x0414,0x0415,0x0416,0x0417,
- 0x0418,0x0419,0x041a,0x041b,0x041c,0x041d,0x041e,0x041f,
- 0x0420,0x0421,0x0422,0x0423,0x0424,0x0425,0x0426,0x0427,
- 0x0428,0x0429,0x042a,0x042b,0x042c,0x042d,0x042e,0x042f,
- 0x2020,0x00b0,0x0490,0x00a3,0x00a7,0x2022,0x00b6,0x0406,
- 0x00ae,0x00a9,0x2122,0x0402,0x0452,0x2260,0x0403,0x0453,
- 0x221e,0x00b1,0x2264,0x2265,0x0456,0x03bc,0x0491,0x0408,
- 0x0404,0x0454,0x0407,0x0457,0x0409,0x0459,0x040a,0x045a,
- 0x0458,0x0405,0x00ac,0x221a,0x0192,0x2248,0x0394,0x00ab,
- 0x00bb,0x2026,0x0020,0x040b,0x045b,0x040c,0x045c,0x0455,
- 0x00b0,0x00b1,0x0406,0x0456,0x0491,0x00b5,0x00b6,0x00b7,
- 0x040e,0x045e,0x040f,0x045f,0x2116,0x0401,0x0451,0x044f,
- 0x0430,0x0431,0x0432,0x0433,0x0434,0x0435,0x0436,0x0437,
- 0x0438,0x0439,0x043a,0x043b,0x043c,0x043d,0x043e,0x043f,
- 0x0440,0x0441,0x0442,0x0443,0x0444,0x0445,0x0446,0x0447,
- 0x0448,0x0449,0x044a,0x044b,0x044c,0x044d,0x044e,0x00a4,
- };
-
-
- /* From Alternativnyj Variant to Unicode */
-
- long avtou[128] = {
- 0x0410,0x0411,0x0412,0x0413,0x0414,0x0415,0x0416,0x0417,
- 0x0418,0x0419,0x041a,0x041b,0x041c,0x041d,0x041e,0x041f,
- 0x0420,0x0421,0x0422,0x0423,0x0424,0x0425,0x0426,0x0427,
- 0x0428,0x0429,0x042a,0x042b,0x042c,0x042d,0x042e,0x042f,
- 0x0430,0x0431,0x0432,0x0433,0x0434,0x0435,0x0436,0x0437,
- 0x0438,0x0439,0x043a,0x043b,0x043c,0x043d,0x043e,0x043f,
- 0x2591,0x2592,0x2593,0x2502,0x2524,0x2561,0x2562,0x2556,
- 0x2555,0x2563,0x2551,0x2557,0x255d,0x255c,0x255b,0x2510,
- 0x2514,0x2534,0x252c,0x251c,0x2500,0x253c,0x255e,0x255f,
- 0x255a,0x2554,0x2569,0x2566,0x2560,0x2550,0x256c,0x2567,
- 0x2568,0x2564,0x2565,0x2559,0x2558,0x2552,0x2553,0x256b,
- 0x256a,0x2518,0x250c,0x2588,0x2584,0x258c,0x2590,0x2580,
- 0x0440,0x0441,0x0442,0x0443,0x0444,0x0445,0x0446,0x0447,
- 0x0448,0x0449,0x044a,0x044b,0x044c,0x044d,0x044e,0x044f,
- 0x0401,0x0451,0x0317,0x0316,0x0301,0x0300,0x2192,0x2190,
- 0x2193,0x2191,0x00f7,0x00b1,0x2116,0x00a4,0x25a0, -1
- };
-
- /* The interpretation of the four symbols following the second
- alphabetic block in AV remains unclear. One suggestion was to treat
- these as (non-spacing) grave and acute, as appearing above upper- or
- lowercase letters, but the graphical rendering in Briabin's original
- article makes clear that the distinction is between acute and grave,
- above or below the letter: this is what the table now has.
-
- But the preponderance of graphical symbols in AV suggests that the
- intention was to provide facilities for character graphics, in which
- case the interpretation is simply straight lines connecting two
- adjacent midpoints of the bounding box. If the box is the unit
- square, these would run from (.5,0) to (0,.5) and to (1,.5), and from
- (.5,1) to (0,.5) and to (1,.5), in this order. (The line segments are
- of course directionless.) Such symbols are not present in Unicode --
- the closest things are 0x25de 0x25df 0x25dc 0x25dd (in this order) but
- these are curved, not straight.
-
- Whether the graphics or the accent usage is more prevalent in actual
- usage only those plugged into the Russian PC community can tell. If
- the graphics usage turns out to be prevalent, these four symbols would
- be reasonable candidates for incorporation into Unicode, perhaps at
- positions 0x25ef to 0x25f3. */
-
-
- /* From Osnovnoj Variant to Unicode */
-
- long ovtou[128] = {
- -2, -2, -2, -2, -2, -2, -2, -2,
- -2, -2, -2, -2, -2, -2, -2, -2,
- -2, -2, -2, -2, -2, -2, -2, -2,
- -2, -2, -2, -2, -2, -2, -2, -2,
- -2, -2, -2, -2, -2, -2, -2, -2,
- -2, -2, -2, -2, -2, -2, -2, -2,
- 0x0410,0x0411,0x0412,0x0413,0x0414,0x0415,0x0416,0x0417,
- 0x0418,0x0419,0x041a,0x041b,0x041c,0x041d,0x041e,0x041f,
- 0x0420,0x0421,0x0422,0x0423,0x0424,0x0425,0x0426,0x0427,
- 0x0428,0x0429,0x042a,0x042b,0x042c,0x042d,0x042e,0x042f,
- 0x0430,0x0431,0x0432,0x0433,0x0434,0x0435,0x0436,0x0437,
- 0x0438,0x0439,0x043a,0x043b,0x043c,0x043d,0x043e,0x043f,
- 0x0440,0x0441,0x0442,0x0443,0x0444,0x0445,0x0446,0x0447,
- 0x0448,0x0449,0x044a,0x044b,0x044c,0x044d,0x044e,0x044f,
- 0x0401,0x0451,0x0317,0x0316,0x0301,0x0300,0x2192,0x2190,
- 0x2193,0x2191,0x00f7,0x00b1,0x2116,0x00a4,0x25a0, -1
- };
-
- /* The same problem with the interpretation of 242-245 as in AV (these
- rows are definitely identical). The low positions of OV are probably
- identical to 176-223 in AV... */
-
-
- /* From ISO8859-5 to Unicode */
-
- long newkoi8tou[128] = {
- -1, -1, -1, -1, -1, -1, -1, -1,
- -1, -1, -1, -1, -1, -1, -1, -1,
- -1, -1, -1, -1, -1, -1, -1, -1,
- -1, -1, -1, -1, -1, -1, -1, -1,
- 0x00a0,0x0401,0x0402,0x0403,0x0404,0x0405,0x0406,0x0407,
- 0x0408,0x0409,0x040a,0x040b,0x040c,0x00ad,0x040e,0x040f,
- 0x0410,0x0411,0x0412,0x0413,0x0414,0x0415,0x0416,0x0417,
- 0x0418,0x0419,0x041a,0x041b,0x041c,0x041d,0x041e,0x041f,
- 0x0420,0x0421,0x0422,0x0423,0x0424,0x0425,0x0426,0x0427,
- 0x0428,0x0429,0x042a,0x042b,0x042c,0x042d,0x042e,0x042f,
- 0x0430,0x0431,0x0432,0x0433,0x0434,0x0435,0x0436,0x0437,
- 0x0438,0x0439,0x043a,0x043b,0x043c,0x043d,0x043e,0x043f,
- 0x0440,0x0441,0x0442,0x0443,0x0444,0x0445,0x0446,0x0447,
- 0x0448,0x0449,0x044a,0x044b,0x044c,0x044d,0x044e,0x044f,
- 0x2116,0x0451,0x0452,0x0453,0x0454,0x0455,0x0456,0x0457,
- 0x0458,0x0459,0x045a,0x00a7,0x045c,0x045d,0x045e,0x045f
- };
-
- /* Use newkoi8tou in combination with isotoibm to derive the unicode
- meaning of the Cyrillic range in the DKOI extension of EBCDIC. If
- someone has DKOI-8 text available, I'd love to actually try... */
-
-
-